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EXL Decision Analytics Methodology Snapshot 



We apply a set of highly effective tools, techniques and best practices for the end-to-end model development cycle 


Stage 1 


Stage 2 


Stage 3 


Stage 4 


Preliminary Data 
Exploration 


Data Preparation 


Variable Creation 


Variable Reduction 


Univariate Analysis (EDD*) 
Modeling and Validation Split 
Bivariate Analysis 

Outlier Treatment 
Missing Imputation 
Roll Ups and Data Merge 


These stages 
demand lot of manual 
effort in analyzing 
and understanding 
each and every 
variable 


Dummy Variable Creation 
Binning and Banding 
Transformations 
Interactions and Groupings 

Variable Clustering 

Inter-Correlation Analysis 

Variance Inflation Factor Test 


These stages require 
business sense and 
out-of-box thinking 
for brainstorming on 
creating hypothesis- 
based variables and 
dropping redundant 
features 


Stage 5 




Stage 6 


Validation and 
Stabilization 
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Modeling Technique Selection 

Model Improvements 

Ensemble 


In-Sample Validation 

Out-of-Time Validation 

Bootstrapping 

Coefficient Blasting 

* Extended Data Dictionary 



These stages require 
good knowledge of 
statistical techniques 
for providing high- 
end quality solutions 









































Objectives and Scope 



Course Goals 

To provide a structured overview of linear and logistic regression modeling concepts used during 
application of EXL DA methodology 

To introduce trainees to SAS syntax for implementation of traditional model development techniques 
To explain interpretation of key SAS output 

Hands on exercises on real life data to practice respective modeling steps during the training course 
To provide helpful “tricks of the trade” 

Beyond the Scope of this Training 

Comprehensive coaching on model building 

Derivation of statistical formulas or terms (unless required as part of methodology explanation) 

Extensions / Advanced Modeling Techniques (GLM, Multinomial Logistic Regression, Machine Learning 


Techniques) 


Self Study Goals 


u Linear and Logistic Regression model development practice on hypothetical data 
n In-depth research on advanced modeling concepts 

Discussion on advanced concepts can be taken up offline 
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Motivation Behind Predictive Modeling 



Two Crucial Uses of Predictive Modeling 

1 Prediction 

It is important when the objective is to estimate a score for each record 
For example: Scorecard Development 


2. Explanation 

It is important when the objective is to identify and interpret the contributing predictors 
For example: Key Driver Analysis 



Low High 

-► 

Importance of Prediction 


Xexl 
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Chapter 1: Linear Regression 
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1.1 What is Regression? 



1.1.1. Regression Analysis 

Meaning 

Study of statistical dependence of a target variable (also known as dependent variable) on one or more 
predictors (also known as independent variables) 

Objective 

To estimate and/or predict the mean value of the dependent variable on the basis of the known values of 
the independent variables 

Synonyms of Dependent and Independent Variables 

Dependent Variable is also known as Target Variable’ or ‘Response Variable’ or ‘Regressand’ or ‘Outcome’ 

Independent Variable is also known as ‘Predictor’ or ‘Explanatory Variable’ or ‘Regressor’ or ‘Covariate’ 


Regression Equation 



Dependent 

Variable 


Y = f(X 1 ,X 2 ...X k )+e 

\ _ | 


k Independent 
Variables 



Stochastic 
Error Term 


Examples 

Average hourly wage depends on education and occupational domain (industry) 

Price of car depends on car weight, fuel efficiency and manufacturing place among other things 
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1.1.2. Linear Regression 


Usage 

Linear Regression technique may be used to study the relation between a dependent and one or more 
independent variables, when the dependent variable is continuous 

Simple Linear Regression vs. Multiple Linear Regression 



Simple Linear Regression 

Multiple Linear Regression 

Definition 

Linear regression in which the 
dependent variable is related to a 
sinale explanatory variable 

Linear regression in which the 
dependent variable is related to two 
or more explanatory variables 

Equation 

7 = P 0 + p i X 1 +e 

X = P 0 + PjXj + $ 2 X 2 ...$ k X k +8 

Example 

Personal consumption expenditure 
(Y) depends on disposable income 
(X-,) 

Crop yield (Y) depends on 
temperature (X^, rainfall (X 2 ), 
sunshine (X 3 ) and fertilizer (X 4 ) 


AyEXL 

kjOkOGDpGr 
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1.2 Meaning of Linearity 



The term ‘linear’ can be interpreted in two ways: 
Linearity in the Variables 
Linearity in the Parameters 





















Exercise 

Exercise 1. Which of the following are cases of Linear Regression Model? Further categorize them into Simple 
Linear Regression Models and Multiple Linear Regression Models. 

a. Default Amount = + 2 (FICO Score) + 3 (Income) + £ 

b. CAT Score = + (# Attempts) + (Educational Background) + £ 

c. Consumption = + (Disposable Income) + £ 

d. Demand Price = + (Quantity Demanded) + £ 



Xexl 
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1.3 Stochastic Disturbance vs. Residual 



1.3.1. Population Regression Function (PRF) 

A linear PRF states that the expected value of the distribution of Y given Xj is functionally related to Xj such that it is 
linear in parameters 


£(FIX i ) = |3 0 + p 1 X i 


Where 0 and ^ are unknown but fixed parameters known as the regression coefficients 


Example: Consumption = $15 + 0.8 (Income) 


Things to Remember 

0 is known as intercept 
i is known as slope 


Box 1 


Optional for Interested Readers 


Population Regression Line 


Sustenance 
level of 
consumption 



E(Y | X) 


= $0.80 / $1 =0.8 


For every 1 dollar increase in income, 
average consumption expenditure of 
individuals in the given population 
increases by 80 cents 


7^cXL 
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Stochastic Specification of PRF 

An individual’s consumption expenditure (for a given income level) is sum of two components: 

E(Y | XJ : Average consumption expenditure, which is Systematic or Deterministic Component 
Sj : Non-systematic Component (known as Stochastic Disturbance or Random Error) 

Y i = E(Y\X i ) + e i 

= Po + Pi-X,- + e, 


Box 2 


Continuing with illustration from Box 1 , for a 
given level of income X;, an individual’s 
consumption expenditure is clustered 
around average consumption of all 
individuals at that X; (i.e. around its 
conditional expectation) 


Optional for Interested Readers 



T^cXL 
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1.3.2. Significance of Stochastic Disturbance Term 


Stochastic disturbance term q is a proxy for all those variables that are omitted from the model, but that collectively 
affect the dependent variable Y 


Box 3 


Optional for Interested Readers 


Why do we need a stochastic disturbance term? Why don’t we use all variables affecting Y? 

Such variables may be unknown due to vagueness of theory (lack of knowledge about the exact hypothesis) 

Even if they are known, quantitative data may not be available 

At least some part of variation in Y may be purely due to intrinsic randomness in human behavior. Even quantitative data 
may not be sufficient to explain these variations 

To keep model equation reasonably simple, it makes sense to retain only significant and stable predictors and to let the 
random disturbance term represent all other variables 

Even if all relevant variables affecting Y are readily available and are retained in the model, the correct form of functional 
relationship between target Y and predictors may be unknown 


T^cXL 
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1.3.3. Sample Regression Function (SRF) 

PRF is an idealized concept. In practice, one rarely has access to the entire population 
In general, only a sample of observations from the population is used 
Sample Regression Function (SRF) is used to estimate the PRF 

/V /v /v 

SRF is expressed as: Y = (3 Q + pjX. 

Where 

Y t = estimator of E{Y\X i ) 
p 0 = estimator of p o 
Pj = estimator of pj 


Xexl 


Stochastic Specification of SRF 

/V /v /v 

SRF in stochastic form is expressed as: Y t = Po + Pi + ^/ 

Where 

s, is the Residual term and can be regarded as an estimate of stochastic disturbance e. 
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1.3.4. Graphical Representation: Stochastic Disturbance vs. Residual 


E(Y 



^ Things to Remember 

Stochastic Disturbance: z i = Y i -E(Y\X i ) 
Residual: £ ,■ = Y t - Y. 

SRF:Y i = |3 0 + p> ; 


PRF : E(Y IX i ) = P 0 + P 1 X l . 


X 


For any Xj to the right of point A, SRF overestimates the true PRF 
For any Xj to the left of point A, SRF underestimates the true PRF 
Such over-estimation and under-estimation is inevitable due to sampling fluctuations 
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1.4 Estimation Method: OLS 


1.4.1. Ordinary Least Square (OLS) Criterion 


Y - 

W' 

• . 

• 

• 

• 

• 

• 

• 

• 

• 

W' 

X 

Actual values of Y for given values of X 




^ Things to Remember 

■i Sum of errors is not minimized. Positive and negative errors offset each other. It is the absolute value of errors that matters. 

■i Sum of absolute errors is not minimized. The magnitude of errors matters. By squaring errors, the error itself is used as a 
weight. In other words, more weight is given to bigger error terms. Hence, the sum of square of errors is minimized. 
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1.4.2. BLUE: Characteristics of OLS Estimator 


An OLS estimator (3, is said to be Best inear Unbiased stimator (BLUE ) of 


Linear 

The estimator is a linear function 
of dependent variable Y in the 
regression model 


Unbiased 

Average or expected value of the 
estimator is equal to the true value 

£(&•) = P, 


Best 

The estimator has minimum 
variance in the class of all such 
linear unbiased estimators 


1.4.3. A Note on Significance Tests 


Hypothesis tests are performed during model build to test significance. For example: 

F-test for overall significance of linear regression model 
t-test for individual variable significance in linear regression model 
Standard error of an estimate is an important component of test statistic 

Caution: Due to violation of any OLS assumption affecting standard errors of estimates, the significance 
tests may become invalid 


| June 30, 2015 | © 2015 ExIService Holdings, Inc. 


Xexl 







Exercise 



Exercise 2. What does point O (in the graph below) signify? Should a modeler go ahead with linear regression 
model fit without any intermediate action? 


Y 


• o 





*- 

x 


[Hint: Recall 

1. Steps taken at data preparation stage 

2 . Objective of OLS method is to minimize sum of squares of errors] 


Xexl 
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1.5 Linear Regression: Key Assumptions 



1.5.1. Assumption 1: Predictor X is non-stochastic 

Interpretation 

Values taken by the regressor X are considered fixed in repeated samples. That is, X is non-random 

Violation Implication 

No serious implication as long as predictor X and disturbance s are uncorrelated, which is yet another 
assumption of Classical Linear Regression Model (Refer to Assumption 6 in Section 1.5.6) 


1.5.2. Assumption 2: Variability in X values 

Interpretation 

X values in a given sample must not all be the same 

Example: Suppose the modeling data corresponds to a particular year (say, 2012). The ‘year’ variable 
would take single unique value ‘2012’ for all records. Such a variable won’t add any value in making any 
prediction. 


Violation Implication 

No estimation possible for coefficient 


EXL 


Q, Things to Remember 

From the list of predictors, drop all variables that take 
single unique value 
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1.5.3. Assumption 3: Zero mean value of disturbance e 

Interpretation 

Assumption that Efo | Xj) = 0 implies that the positive S| values cancel out the negative q values so that 
their average or mean effect on Y is zero 

■i E(8| | XJ = 0 also implies that E(Yi | XJ = 0 + 1 X 1 + ...+ k X k (given that Yj = 0 + 1 X 1 + ...+ k X k + 8|) 
Violation Implication 

■i No impact on the properties of slope coefficients ( v 2 , ■■■. k) 
u If E(sj | Xj) is a non-zero constant, we get a biased estimate of intercept 0 


Box 4 


Optional for Interested Readers 


Why do we get a biased estimate of intercept if mean value of disturbance is a non-zero constant? 

Consider variable linear regression model: 

y = p 0 + p 1 x 1 + p 2 x 2 +...+p t x t +e 
Assume E( e I X V X 2 ,..., X k ) = X, where X is a constant 
E(Y I X l ,X 2 ,...,X k ) = p 0 + p i X 1 + p 2 x 2 + ...+ $ k X k +X 

= (Po +^) + Pi^i + p 2 Y 2 + ...+ $ k x k 
= a + V l X l + p 2 X 2 +...+ V k X k 
Apparently, a = (p 0 + X) is a biased estimate of (3 0 


TYcXL 
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1.5.4. Assumption 4: Homoscedasticity 

Interpretation 

Given the value of X, the variance of disturbances S| is the same for all observations 

variance (e. I X { ) = a 2 


Violation Implication 

Absence of homoscedasticity implies presence of heteroscedasticity. OLS estimates remain unbiased. 

But OLS estimates no longer remain efficient (i.e. there are alternative methods of estimation such as 
WLS with smaller standard errors) and hence significance tests may not be valid 







1.5.5. Assumption 5: No autocorrelation between disturbances 

Interpretation 

Given any two X values, X| and Xj (i * j), the correlation between and Sj (i * j) is zero 

This assumption is more likely to get violated in case of time-series data. Usually, generalized least 
square (GLS) models are used to tackle this problem 

Violation Implication 

OLS estimates remain unbiased 

■ But OLS estimates no longer remain efficient and hence significance tests may not be valid 

1.5.6. Assumption 6: Zero covariance between and X 

Interpretation 

X and s are assumed to be uncorrelated, as the definition of PRF requires that X and s have separate 
(and additive) influence on Y 

Violation Implication 

■i OLS estimates not only become biased, but also inconsistent (i.e. as the sample size increases 
indefinitely, the estimators do not converge to their true population values) 

Xexl 
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1.5.7. Assumption 7: n > k + 1 


Interpretation 

Number of observations (n) must be greater than the number of parameters to be estimated (k + 1) 
where k = Number of Independent Variables (X l5 X 2 , X k ) 

Parameters to be estimated include k slope coefficients ( v 2 , k ) plus 1 intercept coefficient ( 0 ) 
Violation Implication 

Regression coefficients can’t be estimated 


1.5.8. Assumption 8: No perfect multicollinearity 


Interpretation 

There are no perfect linear relationships among the explanatory variables 

Violation Implication 

Perfect Multicollinearity Case 

- Coefficients are indeterminate and standard errors are not defined 

High Multicollinearity Case 


^ Things to Remember 

Inter-correlation analysis and VIF test are 
popular methods of detecting multicollinearity 


Estimation of regression coefficients is possible, but standard errors tend to be large 
Individual variable contribution tends to be less precise as predictors are highly correlated 

Multicollinearity leads to model over-fitting. The overall measure of goodness of fit can be very high, but the t-ratio of 
one or more variables may be statistically insignificant. 
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1.5.9. Assumption 9: Normality of 


Interpretation 

i follow the normal distribution 


Violation Implication 

■i Estimates remain BLUE 

But they are no longer asymptotically efficient (i.e. as sample size grows, estimates are not optimal) 

Note: Assumptions 3, 4, 5 and 9 together imply that ~ NID(0, 2 ), which means is normally and independently 
distributed with mean 0 and constant variance 2 


Box 5 


Why the Normality Assumption? 


Optional for Interested Readers 


Central Limit Theorem (CLT) provides the theoretical justification for the normality assumption 

Recall from Section 1.3.2. that represents the combined influence of a large number of independent variables 
that are not an explicit part of regression model 

Influence of such omitted or neglected variables is expected to be small and random 

By Central Limit Theorem (CLT), if there are large number of independent and identically distributed random 
variables, then the distribution of their sum tends to a normal distribution as the number of such variables 
increase indefinitely 


T^cXL 
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1.6 SAS Implementation 



1.6.1. REG Procedure: SAS Syntax 

Below is the syntax for PROC REG with frequently used options 



OUTPUT OUT = <output dataset> P = <name of predicted value variable> ; 


ODS OUTPUT PARAMETER ESTIMATES = <parameter estimates output dataset> 

QUIT ; 


1 For exhaustive list of options, refer to S/AS OnlineDoc™: Chapter 55: The REG Procedure (http://www.math.wpi.edu/saspdf/stat/chap55.pdf) 
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Selection Methods 


■i NONE 

The complete model specified in the MODEL statement is used to fit the model 
■i FORWARD (i.e. Forward Selection) 

This technique begins with no variable in the model and then the variables are added one by one to the model 
based on their F statistics 

- For each independent variable, F statistics are computed (reflecting variable contribution to the model) 

- Variable with the largest F statistic is added to the model if its p-value < SLE= value 

- Process is repeated until there is no independent variable whose F statistic is more significant than SLE= value 

- Once a variable is in the model, it stays 

■d BACKWARD (i.e. Backward Elimination) 

This technique begins with all variables in the model and then the variables are deleted one by one from the 
model based on their F statistics 

- For each independent variable, F statistics are computed (reflecting variable contribution to the model) 

- Variable with the smallest F statistic is deleted from the model if its p-value > SLS= value 

- Process is repeated until all the variables in the model produce F statistic significant at SLS= value 

- Once a variable is removed from the model, it is never re-considered for inclusion 


Xexl 
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■i STEPWISE 

This technique is similar to the FORWARD selection technique except that the variables already in the model do 
not necessarily stay there 

- Variables are added one by one to the model and the F statistic for a variable to be added must be significant at 
the SLE= value 

- Once a variable is added, stepwise method looks at all the variables in the model and deletes any variable that 
does not produce an F statistic significant at SLS= value 

- Variables are thus entered into and removed from the model in such a way that each forward selection step may 
be followed by one or more backward elimination steps 

- Stepwise process terminates 

■ If no further variable can be added to the model for the specified SLE criterion and no further variable can 
be deleted from the model for the specified SLS criterion; 

■ Or if the variable to be added to the model is the one just deleted from it 


SLE and SLS Values 


■i SLE: SLE refers to a variable’s significance level of entry into the model 
■i SLS: SLS refers to a variable’s significance level of stay within the model 


Commonly Used Values: 0.01, 0.05 
and 0.10 

As a rule of thumb, SLE= 0.05 and 
SLS= 0.05 are used in general 


Low SLE, SLS Values <=> Highly Significant variables are selected 

<=> Fewer variables are selected 

<=> Stricter Approach for Variable Selection 


Significance Level 

Confidence Level 

0.01 

99% 

0.05 

95% 

0.10 

90% 
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1.6.2. Output Interpretation 





















Output Interpretation 


... Continued 


Condition Index = 


| Maximum Eigen Value 

Eigen Value 


Collinearity Diagnostics 




o'* 

,- '■ - f 

; Condition j 


---“-Proportion of Van 

NUM_ 

ation---— 

IND_SALES_ 

N Lff4_C It S TOME R_ 

Eigenvalue 

[ Index ; 

Intercept 

NUIMTRAN_GM 

COMPLAINTS 

GW_GT_10PCT 

VISITS_1WEEK 

3.98403 

1.00000 

0.00258 

0.00370 

0.00812 

0.01310 

0.00341 

0.70851 

2.37131 

0,00183 

0.01083 

0.18251 

0.15859 

0.00016407 

0.18948 

4,58544 

0.00181 

0.28843 

0.17261 

0.77165 

0.01324 

0.09064 

6.62973 

0.06329 

0.69032 

0.27650 

0.05419 

0.22017 

0.02735 

12.07006 

0.93049 

0.00171 

0.36026 

0.00247 

0.76302 


r > 

5 principal components 
based on 5 inputs (one 
intercept plus 4 predictors) 


Proportion of the variance of the estimate 
accounted for by each principal component 



A collinearity problem occurs when a component associated with a high condition index contributes 
strongly (variance proportion greater than about 0.5) to the variance of two or more variables 


Xexl 
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Output Interpretation 


... Continued 


Parameter Estimates 


Low p-value (<0.05) for a variable 
indicates that the variable is significant 


Tolerance = 1 / VIF 


s 



-7- 

/ 




Model 

Dependent 

Variable 

DF 

Estimate 

StdErr 

tValue 

\ 

4 Probt 

Standard 

izedEst 

Tolerance 

Variancel 

nflation 

Label 


1 

MODEL! 

SALES 

Intercept 

1 

1429,90184 

26.21584 

54.5434 

0 

0 


0 

Intercept 


2 

M0DEL1 

SALES 

NUM_TRAN_6M 

1 

0,02487 

0,000706 

35,2159 

7,30232E-243 

0,31941 

0,64802 

1.54315 

Number of transactions in last 6 months 


3 

MODELI 

SALES 

NUM_CQMPLAINTS 

1 

-164,99541 

6.11000 

-27,004 

4,95145E-150 

-0,24644 

0,64011 

1.56222 

Number of complaints registered by customers in last month 


4 

M0DEL1 

SALES 

IN D_SAL E 5_6W_GT_10 PCT 

1 

295.98996 

13.79458 

21,457 

9,038663E-98 

0,19856 

0,62256 

1.60628 

Takes value 1 if sales growth in previous year is greater than 1-0% 


5 

MODEL! 

SALES 

N U M_CU ST0 M E R_V ISIT 5_1 WE E K 

1 

12,75678 

0.36448 

34,9997 

3,23408E-240 

0.32230 

0,62868 

1.59063 

Number of customers visited store in last week 













_^_ 



Model Equation 

Predicted Sales = 


VIF values are low, indicating no 
issue of multicollinearity 


1429.90184 

0.02487 * NUM_TRAN_6M 
164.99541 * NUM_COMPLAINTS 
295.98996 * IND_SALES_GW_GT_10PCT 
12.75678 *NUM CUSTOMER VISITS 1WEEK 


a 


Things to Remember 


VIF measures the inflation in the variance of 
the parameter estimate due to collinearity that 
exists among the predictors 


Model Interpretation 

Number of transactions in last 6 months, high sales growth in previous year and customer visits in last 1 week have positive 
impact on sales 

Number of customer complaints in last month has negative impact on sales 


X 


EXL 
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Output Interpretation 


... Continued 


Variable Contribution Computation 



A 

B 

C 

D 


E 

1 

Variable 

Estimate 

Standardized 

Abs. Std. Estimate 

Contribution 

Estimate 

D = ABS(C) 

E = 

D / (D) 

2 

NUMTRAN6M 

0.02487 

0.31941 

0.31941 


29.4% 

3 

NUMCOMPLAINTS 

-164.99541 

-0.24644 

0.24644 


22.7% 

4 

1N D_S ALES_G W_GT_1 OPCT 

295.98996 

0.19856 

0.19856 


18.3% 

5 

NUM_CUSTOMER_VISITS_1 WEEK 

12.75678 

0.3223 

0.3223 


29.7% 

6 

Total 



(D) = 1.08671 

(E) 

= 100.0% 


Interpretation 

Variable contribution is well distributed across all variables 
« Transaction volume and customer visits are the top predictors of a store’s monthly sales amount 


Things to Remember 

A standardized regression coefficient is computed by 
dividing a parameter estimate by the ratio of the sample 
standard deviation of the dependent variable to the 
sample standard deviation of the regressor 

Xexl 
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DART 


Exercise 


Exercise 3. Credit Line Increase Model (Line Assignment Model) 

A credit card issuing company has identified the list of charge card customers eligible for line increase. It wants to 
predict the amount of line increase for each card holder 
Server : 172.16.70.31 

Location : T:\IND004\sas training\methodology\module_4 

Train Data : train_sample_1 (Number of Observations: 35,525) 



Variable 

Type 

Label 

1 

ID 

Num 

Card-holder identification number 

2 

LIAMT 

Num 

Credit line increase amount 

3 

S P E N D_ALL_C AR DS_12M 

Num 

Spend on all cards in last 12 months 

4 

S P E N DC H_CA R DS_3 M 

Num 

Spend on charge cards in last 3 months 

5 

N U M_30 D P D_3 MAN YACCT 

Num 

Number of months in which any charge account was 30 DPD or more in last quarter 

6 

IND_SPEND_GW_GT_20PCT 

Num 

Takes value 1 if spend growth in previous year is greater than 20% 

7 

IND_UTIL_CH_1 M_GT_150PCT 

Num 

Takes value 1 if utilization on charge cards in last month is greater than 150% 

8 

IND_UTIL_CH_3M_GT_50PCT 

Num 

Takes value 1 if utilization on charge cards in last quarter is greater than 50% 

9 

FICO 

Num 

FICO score of the card-holder 

10 

INCOMEGE100K 

Num 

Takes value 1 if per annum income of card-holder is greater than or equal to 100K 

11 

IND_GRADE_A 

Num 

Takes value 1 if the card-holder belongs to high value customer group 


Build a linear regression model (target variable: LI_AMT) 

Try out selection methods ‘NONE’ and ‘BACKWARD’ and notice the difference in results 
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Chapter 2: Logistic Regression 
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2.1 What is Logistic Regression? 



2.1.1. Logistic Regression 

Usage 

Logistic Regression is a type of regression technique used to study the relation between a dependent and 
one or more independent variables, when the dependent variable is categorical 

Type 

Two types of Logistic Regression comprise: Binary and Multinomial 



Ordered Multinomial Logistic Regression Nominal Multinomial Logistic Regression 

(Dependent variable has ordered categories) (Dependent variable has unordered categories) 


1 Binary Logistic Regression is popularly referred to as ‘Logistic Regression’ and is the focus of this chapter 

2 Multinomial Logistic Regression is beyond the scope of this training module 
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2.1.2. Why Logistic Regression? 

What if OLS Linear Regression technique is used to model a BINARY Dependent Variable? 

Linear Probability Model is defined as: 

Pi = Po + Pi*, 

where p t = probability of occurrence of event 


Box 6 

Optional for Interested Readers 

Linear Probability Model 


Consider a Linear Regression Model: 

In case of binary dependent variable, Y { takes only two values: 0 and 1 

^ = p 0 + p,x,+e, 

E(^.) = lxProb(^. =1) +Ox ProbO" =0) 

=>£(^) = p 0 + p 1 x, ...(l) 

=> E(Y i ) = Pv(Y i =l) 

=>E(Y i ) = p i [Let Pr(Fj = 1) = Pi ] ...(2) 

From (1) and (2), Linear Probability Model can be written as: 

ft = P»+P,^ 


AeXL 
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Two key reasons why OLS Linear Regression does not work with a binary target 



■i Technical Issue: Violation of Assumptions 

A binary (i.e. dichotomous) dependent variable in a linear regression model violates assumptions of 

- Homoscedasticity 

- Normality of the Error Term 

n Fundamental Issue: Bounded Probabilities 

Linear Probability Model: Pi = Po + Pi^O 

- If X has no upper or lower bound, then for any value of there are values of X for which either Pj > 1 
or Pj < 0 

- This is contradictory, as the true values of probabilities should lie within (0,1) interval 
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Solution to Bounded Probabilities 



■i Step 1: Use Odds instead of Probability of Event 

Odds is defined as: 

Oddc_ Pt - P robabili V of event 

1-Pt probabiliVof non-event 

- As probability of event ranges from 0 to 1, odds ranges from 0 to co 

- Transforming probabilities to odds removes the upper bound 


■i Step 2: Take Natural Logarithm of Odds 


Logistic Regression Model: log 


p, ' 


V 


Pi 


= p 0 + p 1 x 1 + p 2 x 2 ...p,x y 


i J 


This is called 'logit or 
'log-odds'. It ranges 

from -oo to "h°° 


Xexl 
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2.1.3. Sigmoid Function 



Logistic Regression Model 




Z = log 


P 


\ 


1 ~P 


= p 0 + p 1 x 1 + p 2 x 2 ...p t x, 


Probability of Event is therefore estimated from logit (‘model score’) by following transformation: 

z 


1 


P = 


l + e 1+e 


-z 


Sigmoid Function or Logistic Function 


where ’Z’ varies from-oo to+co 
’p’ varies fromO tol 


Sigmoid or Logistic Curve 

— 


_ 



- An ‘S’ shaped curve 

P 





Shows an early exponential growth 

- 





- Slows to linear growth in the middle 






- Approaches p = 1 with an exponentially decaying gap — 



i 

Z 


-6 “4 

- 

2 ( 

1 

> 4 i 
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2.2 Estimation Method: MLE 



Maximum Likelihood Estimation (MLE) 

1. Construct Likelihood Function, expressing the likelihood of observing values of dependent variable Y for all n 
observations 

2. Create log likelihood function to simplify the equation 

3. Choose values of ’s to maximize log likelihood function 



Likelihood Function 


Log Likelihood Function 

Taking Log of Likelihood Function 




Derivation: See Appendix A.1 


Substituting values from Sigmoid Function (Section 2.1.3) 


n 


n 


log L = ^ j Y i Z i -^ j log(l + e Zi ) 


i -1 
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2.3 Logistic Regression: Key Assumptions 



2.3.1. Logistic Regression Assumptions 

■/ Dependent variable has to be categorical (dichotomous for binary logistic regression) 

■/ P(Y=1) is the probability of occurrence of event 
v Dependent variable is to be coded accordingly 

■/ For a binary logistic regression, the class 1 of the dependent variable should represent the desired outcome 
■/ Error terms need to be independent. Logistic regression requires each observation to be independent. 

■/ Model should have little or no multicollinearity 

Logistic regression assumes linearity of independent variables and log odds 
/ Sample size should be large enough 

Note; Maximum likelihood estimates are less powerful than ordinary least squares. As a rule of thumb, while OLS needs at least 5 
cases per independent variable, ML needs at least 10 cases per independent variable. Some statisticians even recommend at least 30 
cases for each parameter to be estimated in Logistic Regression. 

2.3.2. Conditions not required for Logistic Regression 

Linear relationship between the dependent and independent variables is not necessary 
Error terms (residuals) do not need to be normally distributed 
Homoscedasticity is not needed 
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2.4 Odds Ratio 



2.4.1. Definition 

Definition 1 

Odds ratio for a predictor is defined as the relative amount by which the odds of the outcome increase 
(Odds Ratio > 1) or decrease (Odds Ratio < 1) when the value of the predictor variable is increased by 1 
unit 


Odds Ratio for predictor x 1 


U-/V 

X 1= l 

( n \ 


p 


h-pj 

x,=o 


where p is the probability of occurrence of event 


Definition 2 

Odds ratio for a predictor is defined as the exponential of its estimated coefficient 

Odds Ratio for predictor x 1 = e Pl 

Proof: See Appendix A.2 


Xexl 
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2.4.2. Interpretation 

Interpretation of odds ratio depends on the type of predictor: binary or continuous 


Odds Ratio > 1 





* 1=1 


> 



U-pJ 


Xj=0 


When X-, is binary 


Relative probability of 
event to non-event is 
higher when X ^ is present 
vis-a-vis when X-, is absent 


When X-, is continuous 


Relative probability of 
event to non-event is 
higher when X ^ increases 
by 1 unit 


Odds Ratio < 1 


f n \ 

f _ ^ 

p 

< 

p 

U-pJ 

li-pj 

X,=l 


X x =0 


When X., is binary 


Relative probability of 
event to non-event is 
lower when X ^ is present 
vis-a-vis when X 1 is absent 


When X., is continuous 


Relative probability of 
event to non-event is 
lower when X ^ increases 
by 1 unit 


Xexl 
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2.5 Frequently Encountered Problems 



2.5.1. Complete Separation Problem 

Meaning 

Complete separation implies that there is some linear combination of the predictors that perfectly predicts 
the dependent variable 

Illustration 


^ step01_create_data.sas 


data outlib, 
input x y; 
datalines ; 


comp_sep; 


1 

2 

3 

4 

5 

6 


Whenever x > 3.5, y = 1 
Whenever x < 3.5, y = 0 
It is a case of complete separation 


£3 step02_run_logistic_regression.sas 


proc logistic data = outlib.comp_sep descending; 
model y = x; 

run; 


j||i comp_sep.sas7bdat 


X 

Y 

1 

1 

0 

2 

2 

0 

3 

3 

0 

4 

4 

1 

5 

5 

1 

6 

6 

1 


© step02 runjogistic regression.log 

WARNING: There is a complete separation of data 
points. The maximum likelihood estimate does not 
exist. 


Xexl 
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2.5.2. Quasi-Complete Separation Problem 

Meaning 

Quasi-complete separation problem exists whenever there is complete separation except for at least a 
single value of the predictor for which both values of the dependent variable occur 

Illustration 


step01_create_data.sas 


data outlib.quasi_sep; 
input x y; 
datalines ; 


1 

0 



2 

0 



3 

0 

For x 

> 4, y = 1 

4 

0 

For x 

< 4, y = 0 

4 

1 

For x 

= 4, there exist one record with y = 0 

5 

1 

and another with y =1 

6 

1 

It is a 

case of quasi-complete separation 


f 




li|§ quasi_sep.sas7bdat 


X 

Y 

1 

1 

0 

2 

2 

0 

3 

3 

0 

4 

4 

0 

5 

4 

1 

6 

5 

1 

7 

6 

1 


@ step02 runjogistic regression.log 

WARNING: There is possibly a quasi-complete 
separation of data points. The maximum likelihood 
estimate may not exist. 

Xexl 
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2.5.3. Remedies 



Problem Detection 

Check warnings in log file 

Identify problematic variables 


6 _ Tip 

In general, categorical (particularly binary) 
predictors cause separation problems 


Check cross tab frequencies of categorical independent variables with the dependent variable 
Look out for cells with zero frequency 


Resolution 


Omit problematic variables (Recommended Solution) 
Redefine problematic variables (if it makes sense) 


Creating a new variable “x_new” from 
“x”, assuming that values 1 and 4 have 
similar meaning and can be clubbed 
together 


Illustration: 


||i quasi_sep.sas7bdat 


X 

Y 

1 

1 

0 

2 

2 

0 

3 

3 

0 

4 

4 

0 

5 

4 

1 

6 

5 

1 

7 

6 

1 


redefine_x.sas 


data new; 

set quasi_sep; 
x_new = x; 

if x = 4 then x_new = 1; 
run; 



j||i new.sas7bdat 


x Y 


x_new 

1 

1 

0 

1 

2 

2 

0 

2 

3 

3 

0 

3 

4 

4 

0 

1 

5 

4 

1 

1 

6 

5 

1 

5 

7 

6 

1 

6 

X 


Xexl 
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2.6 SAS Implementation 


2.6.1. LOGISTIC Procedure: SAS Syntax 

Below is the syntax for PROC LOGISTIC with frequently used options 



MODEL <dependent> = <regressors> 


/ 


SELECTION = <se!ection method> 


Specify variable selection method 


OUTPUT 


SCORE 


SLE 
SLS 
STB ; 
OUT 
P 

DATA 

OUT 


= <SLE criterion> 
= <SLS criterion> 


Specify significance level of entry and stay 


This option displays standardized estimates 


= <train predictions> 

= P_1 ; - 

= <test dataset> 


Specify name of train scored output dataset 


This option requests for score variable name. For example, specify P 1 


Specify name of validation (test) dataset for scoring 


= <testpredictions> ;- Specify name of test scored output dataset 

ODS OUTPUT PARAMETER ESTIMATES = <parameter estimates output dataset> ; 

RUN ; 


1 For exhaustive list of options, refer to SAS OnlineDoc™: Chapter 39: The LOGISTIC Procedure (http://www.math.wpi.edu/saspdf/stat/chap39.pdf) 
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2.6.2. Output Interpretation 



Illustration: Objective is to predict the probability of a student to score more than 80% marks in the final exam 




















Output Interpretation 


... Continued 


Mo del Convergence Status 
Convergence criterion (GC0NV=1E-S) satisfied. 


r. 


Akaike Information Criterion 
(AIC) and Schwarz Criterion 
(SC) penalize for number of 
predictors and can be used to 
compare different models. The 
models with smaller values are 
better. 


Model Fit Statistics 


Criterion 

! aic ! 

! sc J 
l~-2Log L 


Intercept 

Only 

356.797 
361.012 

354.797 


Intercept 

and 

Ccvariates 

192.024 

206.683 

184.024 


-2 Log Likelihood is deviance 
statistic. The lower, the better. 


Test 


Testing Global Null Hypothesis: BzTA=0 

Chi-Square DF Pr > ChiSq 


Likelihood Ratio 170,7731 

.X Score 14S.3349 

Wald 62,9089 


<,0001 
<. 0001 " 
<.0001 


Low p-values indicate that at least 
one of the predictors’ regression 
coefficient is not equal to zero in the 
model, that is the overall model is 
significant 


These are three tests to check 
overall significance of the model. 
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Output Interpretation 


... Continued 


NOTE: No (additional) effects met the 0.05 significance level for removal from the model. 

Summary of Backward Elimination 



effect 


Number 

Wald 


Variable 

Step 

Removed 

DF 

In 

Chi-Square 

Pr > ChiSq 

Label 

1 

ATTENDANCE 

1 

3 

3.1871 

0.@742 
l * 

Attendance (in percentage) 


\ 

_A. 


\ 

\ 


_ * _ 

Variable ‘ATTENDANCE’ 
got eliminated due to high 
p-value (>0.05) 


Xexl 
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Output Interpretation 


... Continued 


Parameter Estimates 



Analysis of Maximum Likelihood Estimates 





' \ 

/ 

Standard 

Wald 

/ 

Standardized 

Parameter 

DF 

Estimate 

Error 

Chi-Square 

Pr > ChiSq 

Estimate 

Intercept 

1 

-9.1153 

1-4764 

38.0165 

<,0001 


AVG_MARKS_PREV_5_TE 5 TS 

1 

0.G953 

0.0175 

29,4848 

<,,0001 

1,3988 

IN D_DEC _MARKS_PR EV_TE5 T 

1 

-1,4206 

0.3805 

13,9375 

0.0002 

-0.3632 

XND_EXT_GUIDE 

1 1 

k 1.0746 / 

/ 

0.4688 

5,2547 

0,0219 / 
\ 

0.2728 


“7 -~ 

✓ \ 


Model Equation / 

/ 

/ 

/ 

Probability of Scoring More Than 80% Marks / 

/ 


Low p-value (<0.05) for a variable 
indicates that the variable is significant 


P_1 = 1 / (1 + e- z ) 


/ 


/ 


/ 


/ 


where Z 


= - 9.1153 

+ 0.0953 * AVG_MARKS_PREV_5_TESTS 
1.4206 * IND_DEC_MARKS_PREV_TEST 
+ 1.0746* IND EXT GUIDE 


Model Interpretation 

Average marks of previous 5 tests and external guidance (tuition) have positive impact on scoring > 80% in final exam 
A declining trend in marks in last test has negative impact on scoring > 80% in final exam 


Xexl 
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Output Interpretation 


... Continued 


Variable Contribution Computation 


Method 1: Based on 
Standardized Estimates 


A 

B 

C 

D 

E 

\ 

' F 

G 

Variable 

Estimate 

Standardized 

Estimate 

Wald Chi 
Square 

Abs. Std. Estimate 

E = ABS(C) 

Contribution 

F = E / (E) 

Contribution 

G = D / (D) 

AVG_MARKS_PREV_5_TESTS 

0.0953 

1.3988 

29.4848 

1.3988 

68.7% 

60.6% 

IND_DEC_MARKS_PREV_TEST 

-1.4206 

- 0.3632 

13.9375 

0.3632 

17.8% 

28.6% 

IND_EXT_GUIDE 

1.0746 

0.2728 

5.2547 

0.2728 

13.4% 

10.8% 

Total 



(D) = 48.6770 

(E) = 2.0348 

(F) = 100.0% 

(G) = 100.0% 


Method 2: Based on 
Wald Chi Sq. values 


Interpretation 

Avg. marks scored in previous 5 tests is the key driver for scoring 80% plus marks in final exam 


1 Another way (Method 3) to compute variable contribute is to check loss in log likelihood by removing one predictor at a time and refitting the 
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Output Interpretation 


Odds 

Ratio Estimates 



Point 

95 % Wald 

Effect 

Estimate 

Confidence Limits 

AV6JiARKSPREV_5_T E S TS 

1-100 

1.063 1-13S 

IN D_DEC_fflARKS_PR EV_TES T 

0.242 

0.115 0.509 

IN D_EXT_G UID E 

2,929 

1-169 7.341 


The LOGISTIC Procedure 

Association of Predicted Probabilities and Observed Responses 


Percent 

Concordant 

92.5 

Somers' D 

0.S61 . 

Percent 

Discordant 

6,4 

Gamma 

0.871 | 

Percent 

Tied 

1-1 

Tau-a 

0-174 

Pairs 


25251 

c 

0.931 
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... Continued 


Likelihood of scoring more than 80% marks increases 
by 10% when average marks in previous 5 tests 
increases by 1 unit 

Students with a decline in score in the most recent are 
75.8% less likely to score >80% marks than other 
students 

Students taking external guidance are 192.9% more 
likely to score >80% marks than other students 


Few important validation 
metrics - To be covered in 
Validation training module 


Xexl 










Exercise 



Exercise 4. VIF values for a Logistic Regression Model 

Continue with logistic regression model illustration where the objective is to predict the probability of scoring more 
than 80% marks in the final exam 
Server : 172.16.70.31 

Location : T:\IND004\sas training\methodology\module_4 

Train Data : train_sample_2 (Number of Observations: 500) 



Variable 

Type 

Label 

1 

ROLLNO 

Num 

Student roll number 

2 

1N D_M A R KS_GT_80 P CT 

Num 

Takes value 1 if a student scores more than 80% marks 

3 

ATTENDANCE 

Num 

Attendance (in percentage) 

4 

AVG_MARKS_PREV_5_TESTS 

Num 

Average marks scored in previous 5 tests 

5 

IND_DEC_MARKS_PREV_TEST 

Num 

Takes value 1 if there was a decline in score in the last test 

6 

INDEXTGUIDE 

Num 

Takes value 1 if student enrolled for external guidance (tuition) 


a. Using PROC LOGISTIC, build a Logistic Regression model (target variable: IND_MARKS_GT_80PCT) and 
tally your output with illustrative output in Section 2.6.2 

b. Report VIF values for final model variables 

[Hint: PROC LOGISTIC does not support VIF option. Use PROC REG for generating VIF values] 


Xexl 
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Chapter 3: Model Improvements 
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3.1 Choice of Modeling Technique 



3.1.1. Is Current Technique Appropriate? 


DONTS 




^ DO’S 


DO look at the distribution of 
dependent variable 

DO residual plot analysis 

* DO NOT blindly apply OLS Linear 
Regression technique, just because 
the dependent variable is not 
categorical 


■i DO NOT apply OLS Linear 
Regression technique if the 
dependent variable is categorical 

(e.g. binary) „ 
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3.1.2. What are the Alternatives 1 ? 



Count Data 
Models 

• Poisson 

• Negative 
Binomial 

• Zero Inflated 

Tools: SAS, R 




Decision 

Trees 

• Classification 
Tree 

• Regression 
Tree 


Tools: CART, R, 
SAS E-Miner 


Machine 

Learning 

• Neural Network 

• Bayesian 
Network 

• Support Vector 
Machines 

Tools: R 


Survival 

Analysis 

• Kaplan Meier 

• Life Table 

• Cox Regression 

• Discrete Time 
Logistic 

Tools: SAS, R 


Time Series 
Forecasting 

• Holt-Winters 

• ARIMA 
•ARCH 

•GARCH 

Tools: SAS, R 





Xexl 


1 This is not to be considered as the exhaustive list of modeling scenarios and techniques 
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3.2 Variable Innovation 



Variable creation in general should precede model development. However, in practice, they 
go hand-in-hand as model development is an iterative process. 

Create as many useful variables as possible 
Innovate to add value 

Variable innovation may be triggered by hypothesis creation or automation need 


Hypothesis 

Driven 

Automation 

Driven 


Variable Creation 


From monthly income and expense information, 
create income trend and expenditure trend variables 

Example 1 : Mathematical transforms like square, 
cube, square root, cube root, log and inverse 

Example 2: Interaction (Variable 1 x Variable 2) 


Variable Innovation 


Create expense to income ratio and its trend 
variable 

Create all mathematical transforms and retain the 
best transform for each predictor 

To maximize coverage, create all possible two-way 
interactions from a given list of predictors and retain 
the ones that add value and can be interpreted 



I am thankful to all those who said no to me. It’s because of them I did it myself. 
- Albert Einstein 


Xexl 
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3.3 Oversampling 



3.3.1. When, Why and How? 


Oversampling is a technique to adjust the class distribution of target variable 


When event rate is low, the 
oversampling of events 

Reduces the class biasness 
between events and non-events 

Improves model performance 


3. How 


2. Why 


Two Common Approaches 

No change in non-events but 
increase the number of events by 
randomly replicating existing events 

No change in events but downsize 
the number of non-events by 
random sampling of non-events 


1. When 


When distribution of dependent variable 
categories is highly skewed 

For instance, event rate < 5% 


X 


EXL 
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3.3.2. Intercept Adjustment 


Oversampling Implications 

Oversampling has no impact on slope coefficients and hence no impact on rank ordering 
Only the intercept term is to be adjusted to obtain correct probabilities 


If o is the intercept term, 

Corrected Intercept = 0 + Offset 


& Tip 

If objective is only to identify top deciles, only rank ordering 
matters and therefore there is no need for intercept adjustment 


Offset = -log 


fl-XY 


X 


71 


A' 


1-71 


where log = Natural Logarithm 
X = True Event Rate 
7i = Sample Event Rate 



Things to Remember 

True Event Rate < Sample Event Rate => Offset < 0 
True Event Rate > Sample Event Rate => Offset > 0 


Illustration: Offset Calculation 


Number of Events 
Number of Non-Events 
True Event Rate 


10K 

200K 

5% 


Suppose 10K non-events are randomly selected from 200K non-events 


Number of Events 
Number of Non-Events 
Sample Event Rate 


Offset = — log 


7l- 0.05^1 

0.50 V 

0.05 J 

U-0.50 J_ 


= -log(19) 
= -2.9444 


10K 

10K 

50% 


T^EXL 
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3.4 Ensemble 



3.4.1. What is Ensemble? 

Ensemble means combining several models into one prediction 

■ An ensemble model works better than the best individual model component of that 
ensemble 

Ensemble technique is most effective when individual model components are diverse 
Diversity in individual model components can be attained through 
Usage of diverse modeling techniques to build individual models 
Usage of diverse variables in individual models 


Xexl 
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3.4.2. Common Ensemble Methods 


List of Common Ensemble Methods 

Majority Voting 

Simple Majority Voting 
Weighted Majority Voting 

Algebraic Combiners 

Min Rule 
Max Rule 
3 Product Rule 
Sum Rule 
Median Rule 
Mean Rule 

Weighted Average Rule 

Advanced Methods 1 

■ Boosting 

Bootstrap Aggregation (BAGGING) 

Random Forest 

1 Advanced methods are beyond the scope of this training module 
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3.4.3. Algebraic Combiners and Majority Voting Illustration 

Assume target variable has two classes C-, and C 2 and there are 5 models to be considered for ensemble 


Model Weights 

0.25 


0.20 


0.10 

0.15 

0.30 


Model 1 

Model 2 

Model 3 

Model 4 

Model 5 



C 2 


C 2 


C 2 


C 2 

Ci C 2 


* 

* 

* 



4 


* 


Predicted Probabilities 

0.85 

0.15 

0.30 

0.70 

0.20 

0.80 

0.60 

0.40 

0.40 0.60 


Ensemble Rule 

Class: C 1 


Class: C 2 


Min Rule 

MIN(0. 85,0.30,0.20,0.60,0.40) = 0.20 


MIN (0.15,0.70,0.80,0.40,0.60) 

= 0.15 

Max Rule 

MAX(0. 85,0.30,0.20,0.60,0.40) = 0.85 


MAX(0 .15, 0.70,0.80,0.40,0.60) 

= 0.80 

Product Rule 

PRODUCT(0. 85,0.30,0.20,0.60,0.40) = 0 

.012 

PRODUCT (0.15,0.70,0.80,0.40,0 

.60) = 0.020 

Sum Rule 

SUM(0 .85,0.30,0.20,0.60,0.40) = 2.35 


SUM(0 .15,0.70,0.80,0.40,0.60) 

= 2.65 

Median Rule 

MEDIAN(0.85, 0.30,0.20,0.60,0.40) = 0. 

40 

MEDIAN (0.15,0.70,0.80,0.40,0. 

60) = 0.60 

Mean Rule 

AVERAGE(0. 85,0.30,0.20,0.60,0.40) = 0 

.47 

AVERAGE (0.15,0.70,0.80,0.40,0 

.60) = 0.53 

Weighted Average Rule 

25% (0.85) + 20% (0.30) + 10%(0.20) + 
15% (0.60) + 30% (0.40) = 0.5025 


25% (0.15) + 20% (0.70) + 10%(0.80) + 

15% (0.40) + 30% (0.60) = 0.4975 

Simple Majority Voting 

2 Votes (Given by Models 1 and 4) 


3 Votes (Given by Models 2,3 

and 5) 

Weighted Majority Voting 

Sum of Weights of Models 1 and 4=0. 

40 

Sum of Weights of Models 2,3 

and 5 = 0 H 60 

V' 
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Exercise 



Exercise 5. Target variable has three categories: C 1s C 2 and C 3 . Ensemble following 5 models using Algebraic 
Combiners and Majority Voting techniques. 


Model Weights -> 0.30 

0.25 

0.20 

0.10 

0.15 

Model 1 

Model 2 

Model 3 

Model 4 

Model 5 

Ci C 2 C 3 

Ci c 2 c 3 

Ci c 2 c 3 

Ci C 2 c 3 

Ci c 2 C 3 

* * * 

4 4 4 

4 4 4 

* * * 

* * ♦ 

Predicted Probabilities 0.85 0.01 0.14 

0.30 0.50 0.20 

0.20 0.60 0.20 

0.10 0.70 0.20 

0.10 0.10 0.80 


Note: Use the following table to tally answers 


Approach 

Ensemble Rule 

Class: C 1 

Class: C 2 

Class: C 3 


Min Rule 

0.10 

0.01 

0.14 


Max Rule 

0.85 

0.70 

0.80 


Product Rule 

0.00051 

0.00021 

0.00090 

Algebraic Combiners 

Sum Rule 

1.55 

1.91 

1.54 


Median Rule 

0.20 

0.50 

0.20 


Mean Rule 

0.310 

0.382 

0.308 


Weighted Average Rule 

0.395 

0.333 

0.272 

Majority Voting 

Simple Majority Voting 

1 Vote 

3 Votes 

1 Vote 

Weighted Majority Voting 

0.30 

0.55 

0.15 


Xexl 
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3.5 Segmentation 



3.5.1. Need for Segmentation 



Different portions of data may be driven by 
different factors 

Variables A and B may be the key drivers of Segment 1 
Variables X,Y and Z may be more relevant for Segment 2 


Possibility of Interaction between a binary 
predictor and other independent variables 

A key predictor with binary values puts a case for different patterns 
across the two classes 

Segmented models are one way of capturing multiple interactions 



Segmentation strategies may boost model 
performance 

Segmented models can be combined 

Lift of logistic regression models and RMSE of linear regression models 
show reasonable improvements in most cases 


Xexl 

look, ogd per 
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3.5.2. Segmentation Strategies 


Business Sense 


• When the modeler has 
a fair idea of general 

patterns at a high level 
and/or has the 
required business 
sense for the purpose 
of practical application 



Flipping Correlation Sign 


\ 

• When there are 
instances of dependent 
variable correlation 
coefficient signs getting 
flipped across two 
subsets of entire 
modeling population 
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• When there is an 

extremely high 
contribution of a binary 
variable in the base 
model 


Dominant Binary 
Contributor 



When it is possible to 
identify some patterns 
in the error terms of the 

base model 



Patterns in Error Term 


Xexl 
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A.1 Logistic Regression Likelihood Function 



Derivation Back to Main Slide 

Likelihood function expresses the likelihood of observing values of dependent variable Y for all n observations 

L = Pr(F 1 ,F 2 ,...,F n ) 

= Pr^) Pr(F 2 )...Pr(F n ) (because observations are assumed to be independent of each other) 


n pr ( y ) 


i =1 


where indicates repeated multiplication 


=n^a-A) 

i-\ 


a-y<) 


.. Pr <y t = 1 ) = P, 1 

' Pr(X = 0) = 1-Pj* 


=> Pr(^) = pf 0--Pi) a ~ Yi) 


n 




x. 


(1 -Pi) 


Xexl 
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A.2 Odds Ratio Proof 



Proof 

Consider a /(-variable logistic regression model: 


Z = log 


P > 


V 


1 -PJ 


= p 0 + p 1 x 1 + p 2 x 2 ...p t x i 


Back to Main Slide 


where p is the probability of occurrence of event 


log (Odds Ratio for predictor x l ) = log 


Odds Ratio for predictor X x 


p ^ 

\-pj 


x,=l 




O-pJ 

Xi=0 


= log 


p ^ 

l ~p , 


[ Refer to Definition 1 in Section 2.4.1 ] 


log 




X, =1 


l-p) 


X { =0 


= (P 0 + p,x, + p 2 x 2 ...p t x t )| Xiri -(P 0 + p,x, + p 2 x 2 -PA) 
= (P„ + p, + p J x 2 ...p t x t )-(p 0 + p 2 X 2 ...p t X t ) 

=p. 


X x =0 


= e Pl 


X 


EXL 
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Thanks 

For queries, contact Varun Aggarwal at Varun.Aqqarwal@exlservice.com 
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